Academic Interest. Trying to build a classification model to predict indivduals who experienced 90 days past due delinquency or worse. The data provided has 10 variables all appropriate for predicting PRobability of Default (PD)

Variable List:

Independent Variable:

SeriousDlqin2yrs (Y): Person experienced 90 days past due delinquency or worse Y/N RevolvingUtilizationOfUnsecuredLines (x1): Total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by the sum of credit limits percentage age (x2): Age of borrower in years integer NumberOfTime30-59DaysPastDueNotWorse (x3): Number of times borrower has been 30-59 days past due but no worse in the last 2 years. integer DebtRatio (x4): Monthly debt payments, alimony,living costs divided by monthy gross income percentage MonthlyIncome (x5): Monthly income real NumberOfOpenCreditLinesAndLoans (x6): Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards) integer NumberOfTimes90DaysLate (x7): Number of times borrower has been 90 days or more past due. integer NumberRealEstateLoansOrLines (x8): Number of mortgage and real estate loans including home equity lines of credit integer NumberOfTime60-89DaysPastDueNotWorse (x9): Number of times borrower has been 60-89 days past due but no worse in the last 2 years. integer NumberOfDependents (x10): Number of dependents in family excluding themselves (spouse, children etc.) integer

Liraries

## Warning: package 'ggplot2' was built under R version 3.2.3

Load Data

train <- read.csv("cs-training.csv",  header = TRUE, sep = ",", row.names = 1)
test <- read.csv("cs-test.csv", header = TRUE, sep = ",", row.names = 1)
train$flg <- "Y"
test$flg <- "N"
data <- rbind(train, test)
names(data) <- c('y','x1','x2', 'x3','x4','x5','x6','x7','x8','x9','x10','flg')
names(train) <- c('y','x1','x2', 'x3','x4','x5','x6','x7','x8','x9','x10','flg')
names(test) <- c('y','x1','x2', 'x3','x4','x5','x6','x7','x8','x9','x10','flg')
dim(data)
## [1] 251503     12

Univariate Exploration

x1: RevolvingUtilizationOfUnsecuredLines

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.00     0.03     0.15     6.05     0.56 50710.00

Data says anything with a ratio > 1 appears to have very high density of bads. As we get closer to the ratio = 1(-ve) the proportions of bads becomes higher. Based on this will treat everything ratio > 1 as a special case. Rest will convert into decile ranges.

x2: Age

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    41.0    52.0    52.3    63.0   109.0

The age range is quite large. For now not planning any clean up.

x3: NumberOfTime30-59DaysPastDueNotWorse

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.0000  0.0000  0.0000  0.2436  0.0000 13.0000

There seems to be a about 269 values of 98 of all the records in the training data set. [Plotted as -1 for better clarity on plot & summary]This can be a case of misusing a existing field in a back end system that is used for some other purpose. This is a integer field and can be converted into a ordered factor before modeling.

x4: DebtRatio

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##      0.0      0.2      0.4    353.0      0.9 329700.0

##      
##       Not Null   Null
##   In    113036   1827
##   Out     7233  27904

Usually debt to income ratios should be smaller than 1. In generally depending on the type of asset product and the economic stress cycle lenders would operate within .35 to .45 debt to income ratio. Since monthly income is also available in the data (x5), we take a decision on what needs to be done with this field after reviewing that variable.

x5: MonthlyIncome

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0    3400    5400    6670    8249 3009000   29731

The spread of income is quite vast. Offcourse this is a continuous variable and there is really no limitations to what one can earn. But might make sense to cap this variable and look at it. 2 issues really- 1. outliers beyond the Q3+ 1.5*IQR range, outliers below Q1 range 2. Missing values, NAs will have to treated or dropped. Needs to be fixed before x4

x6: NumberOfOpenCreditLinesAndLoans

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   5.000   8.000   8.453  11.000  58.000

There is certain right hand skew on the data. 58 lines of loans and credit is bit too much. May convert into quantiles to handle outliers

x7: NumberOfTimes90DaysLate

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.0000  0.0000  0.0000  0.0885  0.0000 17.0000

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.266   0.000  98.000

Yet again there lies a outlier value of 98. This is perhaps a code. We shall have to decide how to treat these values. One option would be convert the integral values into ordered pairs and assign 98 to ‘others’. The Max value otherwise is 17 in itself a very high value.

x8: NumberRealEstateLoansOrLines

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.000   1.018   2.000  54.000

## 75% 
##   5

This is long tailed. Arguably as the #loans goes up the possibility pf default.

x9: NumberOfTime60-89DaysPastDueNotWorse

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2404  0.0000 98.0000

There looks like another incidence of 98.Flag it off and convert rest in deciles ##x10: NumberOfDependents

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   0.757   1.000  20.000    3924

Bivariates

Pruning

Culling some of the outliers to see if some simple relationships does exist

##        y                 x1                 x2              x3         
##  Min.   :0.00000   Min.   :    0.00   Min.   :  0.0   Min.   : 0.0000  
##  1st Qu.:0.00000   1st Qu.:    0.03   1st Qu.: 41.0   1st Qu.: 0.0000  
##  Median :0.00000   Median :    0.15   Median : 52.0   Median : 0.0000  
##  Mean   :0.06684   Mean   :    6.05   Mean   : 52.3   Mean   : 0.4211  
##  3rd Qu.:0.00000   3rd Qu.:    0.56   3rd Qu.: 63.0   3rd Qu.: 0.0000  
##  Max.   :1.00000   Max.   :50708.00   Max.   :109.0   Max.   :98.0000  
##                                                                        
##        x4                 x5                x6               x7        
##  Min.   :     0.0   Min.   :      0   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:     0.2   1st Qu.:   3400   1st Qu.: 5.000   1st Qu.: 0.000  
##  Median :     0.4   Median :   5400   Median : 8.000   Median : 0.000  
##  Mean   :   353.0   Mean   :   6670   Mean   : 8.453   Mean   : 0.266  
##  3rd Qu.:     0.9   3rd Qu.:   8249   3rd Qu.:11.000   3rd Qu.: 0.000  
##  Max.   :329664.0   Max.   :3008750   Max.   :58.000   Max.   :98.000  
##                     NA's   :29731                                      
##        x8               x9               x10             flg           
##  Min.   : 0.000   Min.   : 0.0000   Min.   : 0.000   Length:150000     
##  1st Qu.: 0.000   1st Qu.: 0.0000   1st Qu.: 0.000   Class :character  
##  Median : 1.000   Median : 0.0000   Median : 0.000   Mode  :character  
##  Mean   : 1.018   Mean   : 0.2404   Mean   : 0.757                     
##  3rd Qu.: 2.000   3rd Qu.: 0.0000   3rd Qu.: 1.000                     
##  Max.   :54.000   Max.   :98.0000   Max.   :20.000                     
##                                     NA's   :3924
## [1] 269
## [1] 63725
## [1] 52558
## [1] 97442
##        y                 x1                x2              x3         
##  Min.   :0.00000   Min.   :0.00000   Min.   :29.00   Min.   : 0.0000  
##  1st Qu.:0.00000   1st Qu.:0.03602   1st Qu.:42.00   1st Qu.: 0.0000  
##  Median :0.00000   Median :0.17225   Median :51.00   Median : 0.0000  
##  Mean   :0.06268   Mean   :0.31423   Mean   :51.37   Mean   : 0.2535  
##  3rd Qu.:0.00000   3rd Qu.:0.53553   3rd Qu.:61.00   3rd Qu.: 0.0000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :78.00   Max.   :13.0000  
##        x4                x5              x6               x7          
##  Min.   : 0.0000   Min.   : 1300   Min.   : 0.000   Min.   : 0.00000  
##  1st Qu.: 0.1634   1st Qu.: 3789   1st Qu.: 5.000   1st Qu.: 0.00000  
##  Median : 0.3100   Median : 5592   Median : 8.000   Median : 0.00000  
##  Mean   : 0.3779   Mean   : 6133   Mean   : 9.006   Mean   : 0.08107  
##  3rd Qu.: 0.4817   3rd Qu.: 8047   3rd Qu.:12.000   3rd Qu.: 0.00000  
##  Max.   :95.3009   Max.   :14587   Max.   :58.000   Max.   :17.00000  
##        x8               x9                x10              flg           
##  Min.   : 0.000   Min.   : 0.00000   Min.   : 0.0000   Length:97442      
##  1st Qu.: 0.000   1st Qu.: 0.00000   1st Qu.: 0.0000   Class :character  
##  Median : 1.000   Median : 0.00000   Median : 0.0000   Mode  :character  
##  Mean   : 1.098   Mean   : 0.06082   Mean   : 0.8916                     
##  3rd Qu.: 2.000   3rd Qu.: 0.00000   3rd Qu.: 2.0000                     
##  Max.   :54.000   Max.   :11.00000   Max.   :20.0000
## Warning: Using size for a discrete variable is not advised.

## Warning: Using size for a discrete variable is not advised.

## Warning: Using size for a discrete variable is not advised.

## Warning: Using size for a discrete variable is not advised.

## Warning: Using size for a discrete variable is not advised.

## Warning: Using size for a discrete variable is not advised.

## Warning: Using size for a discrete variable is not advised.

## Warning: Using size for a discrete variable is not advised.

## Warning: Using size for a discrete variable is not advised.

## Warning: Using size for a discrete variable is not advised.

## Warning: Using size for a discrete variable is not advised.

## Warning: Using size for a discrete variable is not advised.

## Warning: Using size for a discrete variable is not advised.

## Warning: Using size for a discrete variable is not advised.

## Warning: Using size for a discrete variable is not advised.

## Warning: Using size for a discrete variable is not advised.

## Warning: Using size for a discrete variable is not advised.

## Warning: Using size for a discrete variable is not advised.

## Warning: Using size for a discrete variable is not advised.

## Warning: Using size for a discrete variable is not advised.

## Warning: Using size for a discrete variable is not advised.

## Warning: Using size for a discrete variable is not advised.

## Warning: Using size for a discrete variable is not advised.

## Warning: Using size for a discrete variable is not advised.

## Warning: Using size for a discrete variable is not advised.

## Warning: Using size for a discrete variable is not advised.

## Warning: Using size for a discrete variable is not advised.

## Warning: Using size for a discrete variable is not advised.

## Warning: Using size for a discrete variable is not advised.

## Warning: Using size for a discrete variable is not advised.

## Warning: Using size for a discrete variable is not advised.

## Warning: Using size for a discrete variable is not advised.

## Warning: Using size for a discrete variable is not advised.

## Warning: Using size for a discrete variable is not advised.

## Warning: Using size for a discrete variable is not advised.

## Warning: Using size for a discrete variable is not advised.

## Warning: Using size for a discrete variable is not advised.

## Warning: Using size for a discrete variable is not advised.

## Warning: Using size for a discrete variable is not advised.

## Warning: Using size for a discrete variable is not advised.

## Warning: Using size for a discrete variable is not advised.

## Warning: Using size for a discrete variable is not advised.

## Warning: Using size for a discrete variable is not advised.

## Warning: Using size for a discrete variable is not advised.

## Warning: Using size for a discrete variable is not advised.

Transformation

# x1
library(gtools)
train$x1lab <- ''
train$x1lab[train$x1 > 1]<- 'Ratio > 1'
#which(train$x1lab != 'Ratio > 1')

#train$x1lab <-quantcut( sales$price, seq(0,1,by=0.1) )

# ifelse (train$x5 > quantile(train$x5, na.rm = TRUE,.95), quantile(train$x5, na.rm = TRUE,.95), 
#         ifelse(train$x5 < quantile(train$x5, na.rm = TRUE,.05),quantile(train$x5, na.rm = TRUE,.05),train$x5 ))